On daily basis, we produced and encountered huge text data in spoken or written forms and different languages. However, the only language computer understand is numbers. So, to be efficient, we need to train computers to understand spoken and written words. This can be achieved through Natural language processing (NLP). NLP gives computers ability to understand written text and spoken words in much the same way human beings can. It enables computers to process human language the form of text or voice data and to ‘understand’ its full meaning, complete with the speaker or writer’s intent and sentiment.
en_us.blog, en_us.new and en_us.twitter, for file sizes, number of characters, words and lines.library(tidytext, warn.conflicts = FALSE)
library(tidyverse, warn.conflicts = FALSE)
library(stringi, warn.conflicts = FALSE)
library(plotly, warn.conflicts = FALSE)
library(qdapRegex, warn.conflicts = FALSE)
library(wordcloud, warn.conflicts = FALSE)
library(RColorBrewer, warn.conflicts = FALSE)
library(syuzhet, warn.conflicts = FALSE)
library(SentimentAnalysis, warn.conflicts = FALSE)
library(sentimentr, warn.conflicts = FALSE)
library(data.table, warn.conflicts = FALSE)
The three data file will be imported and sample taken for further analysis. We will remove profane word by filtering our data with words in the profanity_txt file.
setwd("C:/Users/justi/Documents/Olu_Drive/Coursera/Data_Science_Statistics_and_Machine_Learning_Specialization/Capstone/en_US")
blogs_txt <- readLines("en_US.blogs.txt", warn = FALSE, encoding = "UTF-8", skipNul = TRUE)
news_txt <- readLines("en_US.news.txt", warn = FALSE, encoding = "UTF-8", skipNul = TRUE)
twitter_txt <- readLines("en_US.twitter.txt", warn = FALSE, encoding = "UTF-8", skipNul = TRUE)
profanity_txt <- readLines("profanity.txt", warn = FALSE, encoding = "UTF-8", skipNul = TRUE)
profanity_df <- tibble(profanity_txt)
special_txt <- readLines("special.txt", warn = FALSE, encoding = "UTF-8", skipNul = TRUE)
special_df <- tibble(special_txt)
stopWords_txt <- readLines("stopwords.txt", warn = FALSE, encoding = "UTF-8", skipNul = TRUE)
stopWords_df <- tibble(stopWords_txt)
Here we determined basic features of the data en_us.blog, en_us.new and en_us.twitter. The table below shows file sizes, number of characters (words, spaces and others), words and lines of each data.
| File Type | File Size | Number of Characters | Number of Words | Number of Lines |
|---|---|---|---|---|
| Blogs | 200.42 | 206824505 | 37546250 | 899288 |
| News | 196.28 | 15639408 | 2674536 | 77259 |
| 159.36 | 162096241 | 30093413 | 2360148 |
I will sample 0.5% of each data set blogs_txt, news_txt and twitter_txt to form a single set, sample1_txt. See below the first 3 lines of the sample-txt
[1] "Or put another way – in the spirit of this site’s mission – it’s all bollocks."
[2] "No Regrets for Our Youth – 0"
[3] "Tom: See you!"
First, we need to clean the data and remove irrelevant characters so we can concentrate on the important words from this file.
latin1ASII_func <- grep("latin1ASII", iconv(sample1_txt, "latin1", "ASCII", sub="latin1ASII"))
sample2_txt <- sample1_txt[-latin1ASII_func]
See below the first 5 lines of the clean set.
sample3_txt <- gsub("&", " ", sample2_txt)
sample3_txt <- gsub("RT :|@[a-z,A-Z]*: ", " ", sample3_txt) # remove tweets
sample3_txt <- gsub("@\\w+", " ", sample3_txt)
sample3_txt <- gsub("[[:digit:]]", " ", sample3_txt) # remove digits
sample3_txt <- gsub(" #\\S*"," ", sample3_txt) # remove hash tags
sample3_txt <- gsub(" ?(f|ht)tp(s?)://(.*)[.][a-z]+", " ", sample3_txt) # remove url
sample3_txt <- gsub("[^[:alnum:][:space:]']", "", sample3_txt) # Remove punctuation except apostrophes
sample3_txt <- rm_white(sample3_txt) # remove extra spaces using `qdapRegex` package
See below the first 5 lines of the clean set.
[1] "Tom See you"
[2] "See it's all the fault of evolution"
[3] "But seriously Wells Youngs WHAT IS THIS BULL CRAP ABOUT NOT SELLING IT IN THE UK UNTIL NEXT YEAR Get it sorted I want to be drinking this at Christmas"
A token is a meaningful unit of text, most often a word, that we are interested in using for further analysis, and tokenization is the process of splitting text into tokens.
We need to both break the text into individual tokens and transform it to a tidy data structure. This equivalent to a unigram \(1-gram\). Also, we need to filter out the profane word from the text corpus.
sample_df2 <- sample_df %>%
unnest_tokens(word, text) %>%
filter(!word %in% profanity_df$profanity_txt) %>%
filter(!word %in% special_df$special_txt) %>% # Remove profane words
drop_na()
The count() will will be useful here. This will help use to visualize the dataset. See below five most frequent word and their frequencies \(n\).
unigram <- sample_df2 %>%
count(word, sort = TRUE) %>%
mutate(word = reorder(word, n)) %>%
filter(n > 10)
head(unigram, 5)
# A tibble: 5 x 2
word n
<fct> <int>
1 the 10521
2 to 7141
3 i 5938
4 a 5843
5 and 5626
We use ggplot to generate the histogram and line graph below. Word occurrence is more than 800 times.
# Sentiment and Emotion Analysis Sentiment analysis is use to systematically identify, extract, quantify, and study affective states and subjective information from text data. It help to understand the social sentiment in a data. while Emotion analysis identify and analyze the underlying emotions expressed in the data such as good or bad, sad or happy etc.
The pie chart show that most of the sentiments expressed in the sample text are positive.
r, echo=FALSE, warning=FALSE, message=FALSE, comment=NA}
sentiment_txt <- sample_df2 %>%
filter(!word %in% profanity_df$profanity_txt) %>%
filter(!word %in% special_df$special_txt) %>%
anti_join(stop_words)
sentiment_txt <- sentiment_txt$word
sentiment_df <- analyzeSentiment(sentiment_txt)
# Save data to r object
saveRDS(sentiment_df, "sentiment_df.rds")
# Extract dictionary-based sentiment according to the QDAP dictionary
SentimentQDAP_df <- sentiment_df$SentimentQDAP
# View sentiment direction (i.e. positive, neutral and negative)
sentimentDirection_char <- convertToDirection(SentimentQDAP_df)
sentimentDirection_df <- data.frame("SentimentDirection" = sentimentDirection_char)
# Combine sentiment direction with SentimentQDAP in a data set
sentimentDirection_df$SentimentQDAP <- sentiment_df$SentimentQDAP
# Draw a pie chart
sentimentDirection_df %>%
drop_na() %>%
ggplot(., aes(x = "", y = SentimentDirection, fill = SentimentDirection)) +
geom_bar(width = 1, stat = "identity") +
coord_polar("y", start = 0) +
theme_void()
The histogram below reveal the emotion expressed in the data set.
r, echo=FALSE, warning=FALSE, message=FALSE, comment=NA, results=FALSE}
emotion_txt <- sample_df2 %>%
filter(!word %in% profanity_df$profanity_txt) %>%
filter(!word %in% special_df$special_txt) %>%
anti_join(stop_words)
emotion_txt <- emotion_txt$word
emotion_df <- setDF(emotion_by(get_sentences(emotion_txt)))
# Save data to r object
saveRDS(emotion_df, "emotion_df.rds")
emotion_df$emotionType <- as.character(emotion_df$emotion_type)
emotion_df2 <- emotion_df %>%
select(!emotion_type) %>%
filter(!emotionType %in% c("anticipation_negated", "fear_negated",
"surprise_negated", "disgust_negated",
"sadness_negated", "joy_negated", "trust_negated",
"anger_negated"))
# Histogram
ggplot(emotion_df2) +
aes(reorder(emotionType, emotion_count), emotion_count, fill = emotionType) +
geom_histogram(stat = "identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
coord_flip() +
labs(x = "Emotion", y = "Frequency")
Essentially, the step under Tokenization above is equivalent to \(1-gram\). From here, we visualize the data inform of \(2-gram\), \(3-gram\) and \(4-gram\).
Generate \(2-gram\) token and remove profane words.
bigram <- sample_df %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
separate(bigram, c("word1", "word2"), sep = " ",
extra = "drop", fill = "right") %>%
filter(!word1 %in% profanity_df$profanity_txt,
!word2 %in% profanity_df$profanity_txt,
!word1 %in% special_df$special_txt,
!word2 %in% special_df$special_txt) %>% # Remove profane words
drop_na() %>%
unite(bigram, word1, word2, sep = " ")
See below five most frequent bigram and their frequencies \(n\).
# A tibble: 5 x 2
bigram n
<fct> <int>
1 in the 868
2 of the 854
3 for the 597
4 on the 477
5 to the 466
The histogram and line graph show \(2-gram\) with occurrence of more than 200 times.
Generate \(3-gram\) token and remove profane words
trigram <- sample_df %>%
unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
separate(trigram, c("word1", "word2", "word3"), sep = " ",
extra = "drop", fill = "right") %>%
filter(!word1 %in% profanity_df$profanity_txt,
!word2 %in% profanity_df$profanity_txt,
!word3 %in% profanity_df$profanity_txt,
!word1 %in% special_df$special_txt,
!word2 %in% special_df$special_txt,
!word3 %in% special_df$special_txt) %>% # Remove profane words
drop_na() %>%
unite(trigram, word1, word2, word3, sep = " ")
See below five most frequent trigram and their frequencies \(n\).
# A tibble: 5 x 2
trigram n
<fct> <int>
1 thanks for the 127
2 one of the 64
3 a lot of 62
4 i want to 58
5 looking forward to 57
The histogram and line graph show \(3-gram\) with occurrence of more than 40 times.
Generate \(4-gram\) token and remove profane words.
quadgram <- sample_df %>%
unnest_tokens(quadgram, text, token = "ngrams", n = 4) %>%
separate(quadgram, c("word1", "word2", "word3", "word4"), sep = " ",
extra = "drop", fill = "right") %>%
filter(!word1 %in% profanity_df$profanity_txt,
!word2 %in% profanity_df$profanity_txt,
!word3 %in% profanity_df$profanity_txt,
!word4 %in% profanity_df$profanity_txt,
!word1 %in% special_df$special_txt,
!word2 %in% special_df$special_txt,
!word3 %in% special_df$special_txt,
!word4 %in% special_df$special_txt) %>% # Remove profane words
drop_na() %>%
unite(quadgram, word1, word2, word3, word4, sep = " ")
See below five most frequent quadgram and their frequencies \(n\).
# A tibble: 5 x 2
quadgram n
<fct> <int>
1 thanks for the follow 35
2 thank you for the 20
3 the end of the 20
4 at the end of 18
5 for the first time 17
The histogram and line graph show \(3-gram\) with occurrence of more than 10 times.
## Quintgram
Generate \(5-gram\) token and remove profane words.
quintgram <- sample_df %>%
unnest_tokens(quintgram, text, token = "ngrams", n = 5) %>%
separate(quintgram, c("word1", "word2", "word3", "word4", "word5"), sep = " ",
extra = "drop", fill = "right") %>%
filter(!word1 %in% profanity_df$profanity_txt,
!word2 %in% profanity_df$profanity_txt,
!word3 %in% profanity_df$profanity_txt,
!word4 %in% profanity_df$profanity_txt,
!word5 %in% profanity_df$profanity_txt,
!word1 %in% special_df$special_txt,
!word2 %in% special_df$special_txt,
!word3 %in% special_df$special_txt,
!word4 %in% special_df$special_txt,
!word5 %in% special_df$special_txt) %>% # Remove profane words
drop_na() %>%
unite(quintgram, word1, word2, word3, word4, word5, sep = " ")
See below five most frequent quadgram and their frequencies \(n\).
# A tibble: 5 x 2
quintgram n
<fct> <int>
1 the santelena hotel venice italy 11
2 at pates and fountain parks 10
3 at the end of the 9
4 classic at pates and fountain 8
5 in the middle of the 7
We will use \(2-gram\), $3-grams) and \(4-grams\) to build the required model for next word prediction.
Split each \(N-gram\) into the constituent words and store them back into the same data frame
Find next word for “good”.
Filter data frame where first word is “good” to find the next possible words.
[1] "morning" "to" "luck" "day"
The next possible words are as shown above. We will build a more reliable model going forward.
First, we will creating matching functions for each \(N-grams\)
# Bigram matching
bigram_func <- function(inputWords){
num <- length(inputWords)
# Number of rows to be selected
nRow <- 1L
filter(bigram_df, word1==inputWords[num]) %>%
add_count(word2, sort = TRUE) %>%
top_n(3, n) %>%
filter(row_number() == nRow) %>%
select(num_range("word", 2)) %>%
as.character() -> out
ifelse(out =="character(0)", "?", return(out))
}
# Trigram matching
trigram_func <- function(inputWords){
num <- length(inputWords)
# Number of rows to be selected
nRow <- 1L
filter(trigram_df,
word1==inputWords[num-1],
word2==inputWords[num]) %>%
add_count(word3, sort = TRUE) %>%
top_n(3, n) %>%
filter(row_number() == nRow) %>%
select(num_range("word", 3)) %>%
as.character() -> out
ifelse(out=="character(0)", bigram_func(inputWords), return(out))
}
# Quadgram matching
quadgram_func <- function(inputWords){
num <- length(inputWords)
# Number of rows to be selected
nRow <- 1L
filter(quadgram_df,
word1==inputWords[num-2],
word2==inputWords[num-1],
word3==inputWords[num]) %>%
add_count(word4, sort = TRUE) %>%
top_n(3, n) %>%
filter(row_number() == nRow) %>%
select(num_range("word", 4)) %>%
as.character() -> out
ifelse(out=="character(0)", trigram_func(inputWords), return(out))
}
This function will be use to predict next word when a word or phrase is entered.
ngrams_func <- function(wordPhraseInput){
# Create a dataframe
wordPhraseInput <- data.frame(text = wordPhraseInput)
# Clean the Inpput
replace_reg <- "[^[:alpha:][:space:]]*"
wordPhraseInput <- wordPhraseInput %>%
mutate(text = str_replace_all(text, replace_reg, ""))
# Find word count, separate words, lower case
inputCount <- str_count(wordPhraseInput, boundary("word"))
inputWords <- unlist(str_split(wordPhraseInput, boundary("word")))
inputWords <- tolower(inputWords)
# Call the matching functions
out <- ifelse(inputCount == 1, bigram_func(inputWords),
ifelse(inputCount == 2, trigram_func(inputWords),
quadgram_func(inputWords)))
# Output
return(out)
}
Predict next word for the following words and phrases.
ngrams_func("happy")
[1] "birthday"
ngrams_func("my new")
[1] "york"
ngrams_func("good to see")
[1] "you"
ngrams_func("just to let you")
[1] "are"
ngrams_func("thank you so")
[1] "much"